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ABSTRACT 


Nanowires  and  nanotubes  are  promising  building 
blocks  for  designing  nanoscale  devices  for  sensing  war¬ 
fare  agents.  Computer  models  are  key  to  designing  and 
improving  such  sensors.  Due  to  their  nanoscale  size, 
quantum  models  are  needed  to  model  the  transport  of 
electrons  in  this  type  of  device.  Few  methods  exist 
to  accurately  calculate  quantum  transport  for  systems 
comprising  hundreds  of  thousands  of  atoms.  The  ap¬ 
proach  we  are  taking  is  based  on  the  non  equilibrium 
Green’s  function  approach [1,  2,  3]  (NEGF)  and  is  an 
exact  order  N  method  to  calculate  the  charge  den¬ 
sity  and  the  IV  characteristic  of  a  device.  We  are 
presenting  a  novel  numerical  technique  which  allows 
computing  in  parallel  the  Green’s  function.  The  com¬ 
putational  cost  scales  linearly  with  the  number  atoms 
and  the  parallel  efficiency  on  benchmarks  problems  is 
nearly  optimal.  Based  on  our  benchmarks  results  this 
method  will  enable  modeling  devices  of  unprecedented 
size. 


1  INTRODUCTION 


In  NEGF,  we  need  to  consistently  solve  for  the  eigen¬ 
vectors  of  the  single-particle  Schrodinger  Hamiltonian 

H: 

(H  -  ES)c  =  0 

(where  S  is  the  so-called  overlap  matrix)  and  a  Poisson 
equation  for  the  electrostatic  potential.  Computation¬ 
ally,  the  most  expensive  step  is  obtaining  the  retarded 
and  less-than  Green’s  functions  defined  by: 

Gr  =  (H  —  cS)-1,  and  G<  =  GrT  [Gr]*,  (1) 

where  e  =  E  +  ir],  with  r]  infinitesimal,  and  T  is  the 
complex  broadening  matrix  and  can  be  assumed  to 
have  the  same  non-zero  sparsity  pattern  as  H.  The 
infinitesimal  rj  is  necessary  in  order  to  specify  the 


boundary  conditions.  Because  of  the  requirement  that 
the  Schrodinger  equation  is  solved  consistently  with 
the  Poisson  equation  (which  requires  converging  the 
charge  density)  Equation  (1)  must  be  solved  repeat¬ 
edly  at  all  relevant  energy  levels  until  convergence.  A 
key  aspect  of  this  calculation  is  that  only  the  diago¬ 
nal  entries  of  Gr  and  G<  are  actually  needed  to  com¬ 
pute  the  charge  density.  Since  the  size  of  the  matrices 
scales  linearly  with  the  number  of  atoms,  this  problem 
is  computationally  very  expensive  and  even  on  large 
parallel  computers  it  is  difficult  to  solve  problems  with 
more  than  a  few  hundred  atoms,  which  is  far  below  the 
size  of  new  emerging  sensing  devices. 


Table  1:  notations  used  in  this  paper. 
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d 

AT 

At 

A“t 


L,  D,  U 

aij 

A(il  :  i2,j) 


A(i,  jl  :  j 2) 


A(il:i2,jl:j2) 


I 

0 


matrix 

number  of  blocks  in  A 
block  size 
transpose  matrix 
transpose  conjugate  matrix 
transpose  conjugate  of  the  in¬ 
verse  matrix 
LDU  factorization  of  A 
block  entry  (i,j)  of  matrix  A 
block  column  vector  containing 
rows  i  1  through  i2 
in  column  j 

block  row  vector  containing 
columns  jl  through  j 2 
in  row  i 

sub- matrix  containing  rows  il 
through  i2  and 
columns  jl  through  j 2 
Identity  matrix 
All-zero  matrix 


In  order  to  expand  the  current  simulation  capa¬ 
bilities,  we  have  developed  a  novel  approach  to  solve 
Equation  (1)  on  parallel  computers.  In  principle,  if  H 
has  size  N  x  A,  the  cost  of  computing  Gr  and  G<  is 
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0(N3).  However,  we  only  need  the  diagonal  entries 
of  Gv  and  GK.  We  demonstrated  that  these  entries 
can  be  computed  in  O(N)  steps  only [4].  This  algo¬ 
rithm  will  be  described  first.  Then  we  will  present  our 
parallelization  strategy  and  numerical  results. 


2  FAST  RECURRENCE  TO  COMPUTE 
THE  GREEN’S  FUNCTION 


backward  recurrence  formulas.  These  formulas  can  be 
obtained  by  considering  step  i  of  the  LDU  factoriza¬ 
tion  of  A,  which  has  the  following  form: 


A  = 


L(1  :  i  —  1, 1  :  i  —  1) 
L (i  :  n,  1  :  i  —  1) 


0 

I 


x 


D(1  :  i  —  1, 1  :  i  —  1) 

0 


0 

s* 


U(1  :  i  —  1, 1  :  i  —  1)  U(1  :  i  —  1,  i  :  n) 

0  I 


(4) 


Consider  a  general  matrix  A  written  in  block  form: 


An 

Ala 

A2i 

A22 

(2) 


The  L  and  U  factors  in  a  Gaussian  elimination  are 
given  by:  Ls  =  A2iA^11  and  Us  =  A^11Ai2.  The 
Schur  complement  is  S  =  A22  —  A2iAj“11  Ai2.  The  fol¬ 
lowing  equation  holds  for  the  inverse  matrix  G  =  A-1: 


G  = 


+  Us  S-1  Ls 
— S-1  Ls 


-uss-r 

s-1 


(3) 


This  equation  can  be  verified  by  direct  multiplication 
by  A.  It  allows  computing  the  inverse  matrix  using 


From  Equation  (4),  the  first  step  of  the  LDU  factor¬ 
ization  of  Sz  is  the  same  as  the  i  +  1th  step  in  the 
factorization  of  A: 


S 


1  0 

L (i  +  1  :  n,  i)  I 


0 

si+ 1 


1  U(z,  i  +  1  :  n) 

0  I 


(5) 


Combining  Equation  (5)  and  (3)  applied  to  A  =  S% 
we  arrive  at: 


[ST1 


d^1  +  U(i,  i  +  l:n)  [S^1]"1  L(i  +  1  :  n,  i)  -U (i,  i  +  1  :  n)  [Si+1]-r 
-;S,  !  ']' 1  L(/-  1  :  n,i)  [S^1]"1 


From  Equation  (3),  we  have: 

[S2]-1  =  G(i  :  n, i  :  n) 

and 

[S2+1]_1  =  G(i  +  1  :  n, i  +  1  :  n) 


matrix  G  is  full,  we  do  not  need  to  calculate  all  the 
entries[5,  6,  7].  We  denote  Lsym  and  Usym  the  lower 
and  upper  triangular  factors  in  the  symbolic  factor¬ 
ization  of  A.  The  non-zero  structure  of  (Lsym)T  and 
(Usym)r  is  denoted  by  Q :  this  is  the  set  of  all  pairs 
(i,  j)  such  that  l^m  7^  0  or  u^m  7^  0.  Then: 


We  have  therefore  proved  the  following  backward  re¬ 
currence  relations: 

G(i  +  1  :  n,  i)  = 

—  G(i  +  1  :  n,  i  +  1  :  n)  L (i  +  1  :  n,  i) 

G(i,  i  T  1  :  n )  = 

—  U(i,  i  +  1  :  n)  G(i  +  1  :  n,  i  +  1  :  n)  (6) 

= 

d^1  +  U(i,  i  +  1  :n)  G(i  +  1  :  n,  i  +  1  :  n) 

L (i  +  1  :  n,  i) 

starting  from  gnn  =  .  A  recurrence  for  i  =  n  —  1 

down  to  i  =  1  can  therefore  be  used  to  calculate  G. 
See  Fig.  1. 


( -  E  ^ 


/c>z,  ( k,j)eQ 

du1  +  E  Uifc 

k>i ,  l>i ,  (k,l)EQ 


hi 


if  i>  j 

if  i  <  j 
if  i  =  j 


(7) 


where  (i,  j)  G  Q. 


Proof.  Assume  that  (i,  j)  G  Q,  i  >  j,  then 
u^m  7^  0.  Assume  also  that  l®Jm  7^  0,  l  >  j,  then 
by  properties  of  the  Gaussian  elimination  (i,  Z)  G  Q. 
This  proves  the  first  case.  The  second  case  can  be 
proved  similarly. 


By  itself,  these  recurrence  formulas  do  not  lead 
to  any  reduction  in  the  computational  cost.  How¬ 
ever  the  following  result  shows  that  even  though  the 


For  the  third  case,  assume  that  u^m  7^  0  and 
zfz  0,  then  by  properties  of  the  Gaussian  elimi¬ 
nation  (fc,  /)  G  Q.  □ 
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Fig.  1:  Schematics  of  how  the  recurrence  formulas  are 
used  to  calculate  G. 


This  implies  that  we  do  not  need  to  calculate  all 
the  entries  in  G  but  rather  only  the  entries  in  G  corre¬ 
sponding  to  indices  in  Q.  This  represents  a  significant 
reduction  in  computational  cost.  Using  this  algorithm, 
the  cost  of  computing  entries  in  G  in  set  Q  is  the  same 
(up  to  a  constant  factor)  as  the  LDU  factorization. 


3  Parallel  Algorithm 


For  a  nanowire  or  nanotube  type  structure,  the  matrix 
H-eS  can  be  written  as  a  block  tri-diagonal  matrix. 
Even  though  our  algorithm  can  be  extended  to  general 
sparse  matrices,  in  with  work,  we  focus  on  computing 
Gr  for  tri-diagonal  block  matrices  to  simplify  the  dis¬ 
cussion.  The  matrix  A  is  assumed  to  be  an  n  x  n  block 
matrix,  and  in  block  tridiagonal  form  as  shown  by 


A  = 


( an  aL2 

a21  a22  a23 

a32  a33  a34 

V  ••  •• 


(8) 


■J 


where  each  block  element  a ^  is  a  dense  complex  ma¬ 
trix.  In  order  to  develop  a  parallel  algorithm,  we  as¬ 
sume  that  we  have  at  our  disposal  a  total  of  V  pro¬ 
cesses,  each  labeled  po,  pi,  . . . ,  pv-i-  The  block  tridi¬ 
agonal  structure  A  is  then  distributed  among  these 
processes  in  a  row-striped  manner,  as  illustrated  in 
Fig.  2. 


Po 

Pi 

P2 

P3 


P  0 


Pi 

P2 

P3 


Fig.  2:  Two  different  block  tridiagonal  matrices 
distributed  among  4  different  processes,  labeled  po> 
Pi,  P2  and  p3. 


Thus  each  process  is  assigned  ownership  of  certain 
contiguous  rows  of  A.  This  ownership  arrangement, 
however,  also  extends  to  the  calculated  blocks  of  the 
inverse  G,  as  well  as  any  LU  factors  determined  during 
calculation  in  L  and  U.  The  manner  of  distribution  is 
an  issue  of  load  balancing,  and  will  be  addressed  later 
in  the  paper. 

Furthermore,  for  illustrative  purposes,  we  present 
the  block  matrix  structures  as  having  identical  block 
sizes  throughout.  The  algorithm  presented  has  no  con¬ 
dition  for  the  uniformity  of  block  sizes  in  A,  and  can  be 
applied  to  a  block  tridiagonal  matrix  as  presented  on 
the  right  in  Fig.  2.  Block  sizes  throughout  A,  however, 
does  have  consequences  for  adequate  load  balancing. 

For  the  purposes  of  electronic  structure  applica¬ 
tions,  we  only  require  the  portion  of  the  inverse  G 
with  the  same  block  tridiagonal  structure  structure  of 
A.  We  express  this  portion  as  TridA  {G}. 

The  parallel  algorithm  is  a  hybrid  technique  in  the 
sense  that  it  combines  the  techniques  of  cyclic  reduc¬ 
tion  and  Schur  block  decomposition,  but  where  we  now 
consider  individual  elements  to  be  dense  matrix  blocks. 
The  steps  taken  by  the  hybrid  algorithm  to  produce 
TridA  {G}  is  outlined  in  Fig.  3. 
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The  algorithm  begins  with  our  block  tridiagonal 
matrix  A  partitioned  across  a  number  of  processes,  as 


indicated  by  the  dashed  horizontal  lines.  Each  process 
then  performs  what  is  equivalent  to  a  Schur  decom¬ 
position  on  the  rows/blocks  that  it  owns,  leaving  us 
with  a  reduced  system  that  is  equivalent  to  a  smaller 
block  tridiagonal  matrix  ASchur.  This  phase,  which  we 
name  the  Schur  reduction  phase,  is  entirely  devoid  of 
interprocess  communication. 

It  is  on  this  smaller  block  tridiagonal  structure 
for  which  we  perform  block  cyclic  reduction  (BCR), 
leaving  us  with  a  single  block  of  the  inverse,  gkk-  This 
block  cyclic  reduction  phase  involves  interprocess  com¬ 


munication. 

From  gkk  we  then  produce  the  portion  of  the  in¬ 
verse  corresponding  to  GBCR  =  TridAschur  {G}  in 
what  we  call  the  block  cyclic  production  phase.  Fi¬ 
nally,  using  Gbcr,  we  can  then  determine  the  full 
tridiagonal  structure  of  G  that  we  desire  without  any 
further  need  for  interprocess  communication  through 
a  so-called  Schur  production  phase.  The  block  cyclic 
production  phase  and  Schur  production  phase  are  a 
parallel  implementation  of  the  backward  recurrences 
in  Equation  (6). 


Fig.  3:  This  figure  illustrates  the  distinct  phases  of  operations  performed  by  the  hybrid  method  in  determining 
TridA  {G},  the  block  tridiagonal  portion  of  G  with  respect  to  the  structure  of  A.  The  block  matrices  in  this 
example  are  partitioned  across  4  processes,  as  indicated  by  the  horizontal  dashed  lines. 


Assume  now  that  we  have  P  processors  at  our  dis¬ 
posal.  The  N  blocks  can  be  split  into  P  clusters  of 
contiguous  blocks,  each  assigned  to  a  processor.  The 
parallel  algorithm  is  decomposed  into  3  steps.  Step  1 
is  an  embarrassingly  parallel  step  (it  requires  no  com¬ 
munication)  and  consists  in  eliminating  all  the  blocks 
inside  the  clusters  except  the  block  at  the  right  end  of 
the  cluster.  Step  2  requires  2  log2  P  passes  and  involves 
communication.  Step  3  produces  all  the  diagonal  en¬ 
tries  of  Gr  and  G<  using  a  sequence  of  operations 
similar  to  Step  1,  again  requiring  no  communication. 


4  NUMERICAL  RESULTS 

By  performing  an  operation  count  under  the  assump¬ 
tion  of  equal  block  sizes  throughout  A,  and  that  an 
LU  factorization  costs  on  the  order  of  |  d3  operations 
and  a  matrix-matrix  multiplication  costs  2 d3  opera¬ 
tions  for  a  matrix  of  dimension  d,  we  can  estimate  the 
ratio  a  of  number  of  rows  assigned  to  the  first  and  last 
processes  vs.  central  processes  (see  Fig.  2).  An  anal¬ 
ysis  of  our  implemented  algorithm  predicts  a  =  2.636 
to  be  an  optimal  choice.  For  the  sake  of  completeness, 
we  investigate  the  case  of  a  =  1,  where  each  process 
is  assigned  the  same  number  of  rows,  while  the  values 
a  =  2  and  a  =  3  are  chosen  to  bracket  the  optimal 
choice. 

Execution  time  was  measured  on  implemented  ver- 
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sions  of  a  pure  block  cyclic  reduction  algorithm  and  of 
our  hybrid  algorithm.  The  wall  time  measurements 
for  running  these  algorithms  on  a  block  tridiagonal 
matrix  A  with  n  —  512  block  rows  with  blocks  of  di¬ 
mension  m  =  256  are  given  in  Fig.  4[a-d]  for  the  four 
different  load  balancing  values  of  a  =  {1,2,2.636,3}. 
A  total  number  of  processors  used  for  execution  was 
V  =  {1,  2, 4,  8, 16,  32,  64}  in  all  cases. 


Fig.  4d:  a  =  3. 


This  figure  illustrates  the  wall  time  curves  for  the 
application  of  our  hybrid  algorithm  and  pure  BCR 
under  differing  values  of  a  as  a  basic  load  balancing 
parameter.  The  curves  are  produced  by  applying  the 
different  algorithms  on  a  block  tridiagonal  matrix  A 
with  n  =  512  diagonal  blocks,  each  of  dimension 
m  =  256. 


The  speedup  results  for  the  corresponding  wall 
time  measurements  are  given  in  Fig.  5[a-d]  for  the 
four  different  load  balancing  values  of  a.  For  a  uni¬ 
form  distribution  where  a  =  1,  a  serious  performance 
hit  is  experienced  for  V  =  4.  This  is  due  to  a  poor 
load  balance,  as  the  first  and  last  processes  po  and 
Ps  terminate  their  Schur  production/reduction  phases 
much  sooner  than  the  central  processes  pi  and  p2-  It 
is  then  observed  that  choosing  a  basic  load  balancing 
parameter  of  a  =  2,  2.636  or  3  eliminates  this  dip  in 
speedup. 


Fig.  5a:  a  =  1. 
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Fig.  5b:  a  =  2. 


Fig.  5c:  a  =  2.636. 


Fig.  5d:  a  =  3. 


This  figure  illustrates  the  speedup  curves  for  the 
application  of  our  hybrid  algorithm  and  pure  BCR 
under  differing  values  of  a  as  a  basic  load  balancing 
parameter.  The  curves  are  produced  by  applying  the 
different  algorithms  on  a  block  tridiagonal  matrix  A 
with  n  =  512  diagonal  blocks,  each  of  dimension 
m  =  256. 


Fig.  6:  The  total  execution  time  for  our  hybrid 
algorithm  is  plotted  against  the  number  of  processes 
V.  The  curves  were  produced  for  an  example  block 
tridiagonal  matrix  A  with  n  =  512  diagonal  blocks, 
each  of  dimension  m  =  256. 


It  can  also  be  seen  that  as  the  number  of  pro¬ 
cessors  V  increases  for  a  fixed  n,  a  greater  portion 
of  execution  time  is  attributed  to  the  BCR  phase  of 
our  hybrid  algorithm.  Ultimately  higher  communica¬ 
tions  and  computational  costs  of  the  BCR  over  the 
embarrassingly  parallel  Schur  phase  of  the  algorithm 
dominate,  and  the  speedup  curves  level  off  and  drop, 
regardless  of  a. 

In  an  effort  to  determine  which  load  balancing  pa¬ 
rameter  a  is  optimal,  we  compare  the  wall  time  execu¬ 
tion  measurements  for  our  hybrid  algorithm  in  Fig.  6. 
From  this  figure,  we  can  conclude  that  a  uniform  distri¬ 
bution  with  a  =  1  leads  to  generally  poorer  execution 
times.  Other  values  lead  to  improved  execution  times 
over  a  =  1,  particularly  in  the  range  V  =  4  . . .  8,  which 
is  a  common  core  count  in  high-end  desktop  comput¬ 
ers.  This  suggests  that  a  certain  amount  of  flexibil¬ 
ity  can  be  afforded  in  deviating  from  the  theoretically 
ideal  load  balancing  of  a  =  2.636. 


5  CONCLUSION 


We  have  presented  a  new  algorithm  and  parallelization 
strategy  to  calculate  the  diagonal  entries  of  Green’s 
functions.  This  has  application  in  quantum  trans¬ 
port  in  nano-devices,  for  example  nano-sensors  using 
nanowires  and  carbon  nanotubes.  An  optimization 
strategy  has  been  proposed  to  improve  the  load  bal¬ 
ancing  and  the  speed-up.  The  speed-up  improves  as 
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the  length  of  the  device  increases  (parameter  n).  Fur¬ 
ther  improvements  can  be  obtained  by  employing  a 
different  strategy  to  parallelize  the  BCR  part  of  the 
algorithm  which  allows  reducing  the  amount  of  com¬ 
munication  by  reducing  the  number  of  passes  in  the 
algorithm.  This  will  be  presented  in  a  future  paper. 
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